Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-38837: [Format] Add the specification for statistics schema #45058

Merged
merged 10 commits into from
Dec 31, 2024

Conversation

kou
Copy link
Member

@kou kou commented Dec 18, 2024

Rationale for this change

Statistics are useful for fast query processing. Many query engines
use statistics to optimize their query plan.

Apache Arrow format doesn't have statistics but other formats that can
be read as Apache Arrow data may have statistics. For example, Apache
Parquet C++ can read Apache Parquet file as Apache Arrow data and
Apache Parquet file may have statistics.

One of the Apache Arrow C streaming interface use cases is the following:

  1. Module A reads Apache Parquet file as Apache Arrow data
  2. Module A passes the read Apache Arrow data to module B through the
    Arrow C data interface
  3. Module B processes the passed Apache Arrow data

If module A can pass the statistics associated with the Apache Parquet
file to module B, module B can use the statistics to optimize its
query plan.

What changes are included in this PR?

We standardize how to represent statistics as an Apache Arrow array
for easy to exchange.

We don't standardize how to pass the statistics array. You can use any
interface for it. For example, you can us ethe Apache Arrow C data interface.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes.

@kou
Copy link
Member Author

kou commented Dec 18, 2024

@github-actions crossbow submit preview-docs

Copy link

⚠️ GitHub issue #43553 has been automatically assigned in GitHub to PR creator.

@kou kou changed the title GH-43553: [Format] Add the specification for statistics schema GH-38837: [Format] Add the specification for statistics schema Dec 18, 2024
Copy link

Revision: 5308f9f

Submitted crossbow builds: ursacomputing/crossbow @ actions-a2d5a054aa

Task Status
preview-docs GitHub Actions

@kou
Copy link
Member Author

kou commented Dec 18, 2024

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for continuing to work on this! A few optional thoughts on TODOs.

can access to proper field by type code not name. So we can use
any valid name for fields.

TODO: Should we standardize field names?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any reason to standardize the names, but a reason I could see to use explicit type IDs for at least a few commonly used statistic types would be to ensure that an ArrowArray (or standalone RecordBatch message) could be interpreted without a ArrowSchema. (Completely optional!)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no objection for not standardizing field names.

So I don't standardize field names.

- Nullable
- Notes
* - key
- ``dictionary<indices: int32, dictionary: utf8>``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason I can see why this would be problematic is that the statistics values would require more than one IPC message to represent. (Completely optional: this may not be an important consideration!)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't notice the point. Thanks.

The current proposed specification doesn't focus on any transports/protocols/APIs/.... So this may not be a problem. If this representation doesn't match for our IPC formats, users just don't use this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paleolimbot it's hard to shoehorn auxiliary data to the existing protocols. By using Arrow Arrays themselves to push statistics we at least don't have to introduce another serialization format (like protobuf, JSON [how this proposal started], more flatbuffers...).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point was just that introducing a slight restriction would probably cut the IPC serialized size in half (or more) and/or reduce the amount of marshalling required in the C Data interface case. (It doesn't sound like this is a concern for anybody and I don't really mind either way).

- The maximum size in bytes of a row in the target
column. (exact)
* - ``ARROW:max_byte_width:approximate``
- ``float64``: TODO: Should we use ``int64`` instead?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The float64ness makes sense to me here because the calculation providing this approximate value almost certainly returns a non-exact value (i.e., not an integer, even though the exact value is definitely an integer).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no objection for using float64.

So I use float64.

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting review Awaiting review and removed awaiting committer review Awaiting committer review labels Dec 18, 2024
@kou
Copy link
Member Author

kou commented Dec 19, 2024

Thanks for sharing your opinions!
I should have added links of related discussions to TODOs. I'll add them.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review awaiting changes Awaiting changes labels Dec 19, 2024
@github-actions github-actions bot removed the awaiting changes Awaiting changes label Dec 23, 2024
@kou
Copy link
Member Author

kou commented Dec 23, 2024

@github-actions crossbow submit preview-docs

@github-actions github-actions bot added the awaiting change review Awaiting change review label Dec 23, 2024
Copy link

Revision: 195c374

Submitted crossbow builds: ursacomputing/crossbow @ actions-f825363132

Task Status
preview-docs GitHub Actions

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Dec 23, 2024
@kou
Copy link
Member Author

kou commented Dec 23, 2024

Copy link
Contributor

@felipecrv felipecrv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions. Please double-check them.

I'm not sure if my suggestion to change the way we describe the dictionary type makes sense, but I find dictionary<values=.., indices=..> much less confusing than dictionary<indices=.., dictionary=...>.

I think "statistics key" should be referred to as "statistics name".

docs/source/format/StatisticsSchema.rst Outdated Show resolved Hide resolved
docs/source/format/StatisticsSchema.rst Outdated Show resolved Hide resolved
docs/source/format/StatisticsSchema.rst Outdated Show resolved Hide resolved
docs/source/format/StatisticsSchema.rst Outdated Show resolved Hide resolved
- Nullable
- Notes
* - key
- ``dictionary<indices: int32, dictionary: utf8>``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paleolimbot it's hard to shoehorn auxiliary data to the existing protocols. By using Arrow Arrays themselves to push statistics we at least don't have to introduce another serialization format (like protobuf, JSON [how this proposal started], more flatbuffers...).

docs/source/format/StatisticsSchema.rst Outdated Show resolved Hide resolved
docs/source/format/StatisticsSchema.rst Outdated Show resolved Hide resolved
docs/source/format/StatisticsSchema.rst Outdated Show resolved Hide resolved
docs/source/format/StatisticsSchema.rst Outdated Show resolved Hide resolved
docs/source/format/StatisticsSchema.rst Outdated Show resolved Hide resolved
Co-authored-by: Felipe Oliveira Carvalho <[email protected]>
@github-actions github-actions bot removed the awaiting changes Awaiting changes label Dec 24, 2024
@github-actions github-actions bot added the awaiting change review Awaiting change review label Dec 24, 2024
@kou

This comment was marked as outdated.

This comment was marked as outdated.

@kou
Copy link
Member Author

kou commented Dec 24, 2024

@github-actions crossbow submit preview-docs

@kou
Copy link
Member Author

kou commented Dec 24, 2024

Thanks for your suggestions!
All of them make sense. I've applied all of them and adjusted other places based on the suggestions.

Copy link

Revision: 3b9c918

Submitted crossbow builds: ursacomputing/crossbow @ actions-ce837e4f33

Task Status
preview-docs GitHub Actions

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a review on an early draft of this: #43553 (review)

In general I think this approach seems well reasoned and thought out to me -- while there are other potential implementations I think this specification is very reasonable and understandable.

Nice work @kou @paleolimbot and @felipecrv ❤️

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Dec 24, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think this information would be straightforward to convert to/from the structure we use for Statistics in DataFusion https://docs.rs/datafusion/latest/datafusion/physical_plan/struct.Statistics.html

docs/source/format/StatisticsSchema.rst Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting merge Awaiting merge awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting merge Awaiting merge labels Dec 24, 2024
Co-authored-by: Felipe Oliveira Carvalho <[email protected]>
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Dec 24, 2024
>
>

Statistics array::
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copied from: https://github.com/apache/arrow/pull/43553/files#r1896750644

I didn't understand this example. I thought the statistics were structs, so I would have expected the data to look something like this (perhaps we could give the "logical contents" and then the specific array encoding):

[ 
  // first struct element
  { 
    column: null, # record batch
     statistics: {
        "ARROW:row_count:exact": 0
     }
   },
  { 
    column: 0, # vendor_id
     statistics: {
        "ARROW:null_count:exact": 0,
        "ARROW:distinct_count:exact": 2,
        "ARROW:max_value:exact": 5,
        "ARROW:min_value:exact": 1,
     }
   },
...
]

I can help work out the example if people think this is a good idea

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it makes sense.
I used physical representation (columnar representation) here but logical representation (row based representation) may be easy to understand.

How about showing both of them? (We keep the current representation and add row based representation like you suggested.)

@kou
Copy link
Member Author

kou commented Dec 24, 2024

@alamb Thanks for reviewing GH-43553!

I replied all of them in GH-43553 but I copied the #45058 (comment) discussion to here. I think that it should be kept discussing.

@kou
Copy link
Member Author

kou commented Dec 31, 2024

The vote carried: https://lists.apache.org/thread/yp8gqkv74zchtv55jfqojntw025wrslq

I'll merge this.

@kou kou merged commit 1e45e18 into apache:main Dec 31, 2024
10 checks passed
@kou kou removed the awaiting changes Awaiting changes label Dec 31, 2024
@kou kou deleted the docs-statistics-array branch December 31, 2024 08:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants